Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support format_number #9281

Merged
merged 22 commits into from
Sep 29, 2023
Merged

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Sep 21, 2023

Partly supported #9173

This PR support format_number for integral and decimal type. Float/double support is still wip, I plan to refactor this part with a string operations based solution to avoid precision error.

And for float/double type, the results will be mismatch between spark and plugin for large/small numbers.

It is because we should first convert a float/double to string correctly before formatting it, but in plugin casting float/double to string doesn't match Spark/Java's result, see compatibility doc. We may need a custom kernel for float to string casting, see #4204.

The solution is quite long and calls many times of cuDF API, which may make it slower than excepted. I did some performance test, it ran faster than CPU.

performance test results

10000000 random number generated by BigDataGen:

val dataTable = DBGen().addTable("data", "a {{{type}}}", 10000000)
dataTable.toDF(spark).write.mode("overwrite").parquet("{{{type}}}_for")

test code:

spark.time(df.selectExpr("COUNT(format_number(a, -1)) as a", "COUNT(format_number(a, 0)) as b", "COUNT(format_number(a, 5)) as c", "COUNT(format_number(a, 50)) as d").show())
Data Type GPU Time (ms) CPU Time (ms)
double 11016 22905
float 1464 10307
int 2743 5127
short 545 3967
byte 536 3481
long 3251 7002

@thirtiseven thirtiseven changed the title WIP: Support format_number Support format_number Sep 25, 2023
@thirtiseven thirtiseven self-assigned this Sep 25, 2023
@thirtiseven thirtiseven marked this pull request as ready for review September 25, 2023 14:43
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only got part way through. I'll finish the review when it is no longer in draft.

@sameerz sameerz added the feature request New feature or request label Sep 25, 2023
@sameerz
Copy link
Collaborator

sameerz commented Sep 25, 2023

I did some performance test on long type, it ran faster than CPU. I will do more tests and update soon.

Can you please add the performance test results here?

}
}
(intPartExp, decPartExp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If execption occurs at line 2371, then intPartExp will be leaked.
Should handle this resource pair as a whole.

      withResource(ArrayBuffer.empty[AutoCloseable]) { resource_array =>
        
        var res1 = genRes()
        resource_array ++= res1
        
        var res2 = genRes()
        resource_array ++= res2
        
      }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated all such cases I can founded.

}
}
(intPart, decPart)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle resource pair as a whole.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to go through this, but the complexity, especially for the float/double code is rather hard to follow. Also in my own testing I saw a lot of problems with the float/double results. It looks like it is related to rounding and to the amount of precision we can get as output from casting a float to a string using CUDF. I'm not sure if we can fix that without a custom kernel.

Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comment for the processing will be better for review.

@thirtiseven
Copy link
Collaborator Author

I did some performance test on long type, it ran faster than CPU. I will do more tests and update soon.

Can you please add the performance test results here?

Ok, updated them in the PR description.

}
val substrs = closeOnExcept(sepCol) { _ =>
(0 until maxstrlen by 3).map { i =>
Copy link
Collaborator

@firestarman firestarman Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:
I think we do not need to reverse and reverse back the input string if we slice strings from end to start, something like

var curEndsCol: ColumnVector = strlen

val substrs = withResource(curEndsCol) { _ =>
  (0 until maxstrlen by 3).safeMap { _ =>
    val startCol = curEndsCol - 3,
    val sub = closeOnExcept(startCol) { _ =>
      str.substring(startCol, curEndsCol)
    }
    curEndsCol.close()
    curEndsCol = startCol
    sub
  }.reverse
}

You need to do strlen - 3 in columnar way, e.g.

withResource(Scalar.fromInt(3)) { scalar3 =>
   strlen.sub(scalar3)
}

This is just another option, not sure if it would have better perf.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very close.

@revans2
Copy link
Collaborator

revans2 commented Sep 27, 2023

Sorry that this is late, but I am seeing some errors with the latest code for decimal. It looks like the rounding is off in some cases.

In the spark shell I ran

import org.apache.spark.sql.DataFrame

def compare(left: DataFrame, right: DataFrame): DataFrame = {
  val leftCount = left.groupBy(left.columns.map(col(_)): _*).count
  val rightCount = right.groupBy(right.columns.map(col(_)): _*).count
  val joinOn = leftCount.columns.map(c => leftCount(c) <=> rightCount(c)).reduceLeft(_ and _)
  val onlyRight = rightCount.join(leftCount, joinOn, joinType="left_anti").withColumn("_in_column", lit("right"))
  val onlyLeft = leftCount.join(rightCount, joinOn, joinType="left_anti").withColumn("_in_column", lit("left"))
  onlyRight.union(onlyLeft)
}

spark.conf.set("spark.rapids.sql.enabled", false)
spark.time(spark.range(100000000L).selectExpr("*", "format_number(1 / CAST(id AS DECIMAL(38,0)), 4) as fnid", "1 / CAST(id as DECIMAL(38, 0))").write.mode("overwrite").parquet("/data/tmp/TEST_OUT_CPU"))
spark.conf.set("spark.rapids.sql.enabled", true)
spark.time(spark.range(100000000L).selectExpr("*", "format_number(1 / CAST(id AS DECIMAL(38,0)), 4) as fnid", "1 / CAST(id as DECIMAL(38, 0))").write.mode("overwrite").parquet("/data/tmp/TEST_OUT"))
spark.time(compare(spark.read.parquet("/data/tmp/TEST_OUT"), spark.read.parquet("/data/tmp/TEST_OUT_CPU")).orderBy("id", "_in_column").show(false))

It produced the following

+-----+------+---------------------------------------+-----+----------+         
|id   |fnid  |(1 / CAST(id AS DECIMAL(38,0)))        |count|_in_column|
+-----+------+---------------------------------------+-----+----------+
|32   |0.0313|0.0312500000000000000000000000000000000|1    |left      |
|32   |0.0312|0.0312500000000000000000000000000000000|1    |right     |
|160  |0.0063|0.0062500000000000000000000000000000000|1    |left      |
|160  |0.0062|0.0062500000000000000000000000000000000|1    |right     |
|800  |0.0013|0.0012500000000000000000000000000000000|1    |left      |
|800  |0.0012|0.0012500000000000000000000000000000000|1    |right     |
|4000 |0.0003|0.0002500000000000000000000000000000000|1    |left      |
|4000 |0.0002|0.0002500000000000000000000000000000000|1    |right     |
|20000|0.0001|0.0000500000000000000000000000000000000|1    |left      |
|20000|0.0000|0.0000500000000000000000000000000000000|1    |right     |
+-----+------+---------------------------------------+-----+----------+

I compared the results to bround, and it looks like we have a bug in there somewhere. I'll file a separate issue for that. Just giving you a heads up.

val numberToRoundStr = withResource(zeroPointCv) { _ =>
withResource(leadingZeros) { _ =>
ColumnVector.stringConcatenate(Array(zeroPointCv, leadingZeros, intPart, decPart))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: scalar version is better.

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Sep 28, 2023

@revans2 Thanks for the review, I think I fixed the memory issues.

I'm not sure we can fix this without a custom kernel.

I don't think we can fully match Spark's result on the plugin side for float/decimal yet, considering these cuDF issues. This PR can produce correct results with limited precision and aims to require minimal change to fully support double/float when float to string in JNI is ready.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@@ -3102,6 +3102,12 @@ object GpuOverrides extends Logging {
s" ${RapidsConf.ENABLE_FLOAT_FORMAT_NUMBER} to true.")
}
}
case dt: DecimalType => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once rapidsai/cudf#14210 is fixed we should come back and retest to be sure that it is working properly.

@revans2
Copy link
Collaborator

revans2 commented Sep 28, 2023

build

@thirtiseven
Copy link
Collaborator Author

Thanks all for the review and help! merging this...

@thirtiseven thirtiseven merged commit 7bffb16 into NVIDIA:branch-23.10 Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants